-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29a9f5f
to
236241c
Compare
@ggerganov @ngxson If you would be willing, I'd like to hear any thoughts you have. I may dramatically change the backend memory implementation, but I want to make sure the way I'm interacting with main.cpp and server.cpp is reasonable. |
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
236241c
to
02d643b
Compare
@markhpc I am not familiar with the "ChatGPT memories" feature and how it works. And briefly looking at the implementation I still don't know what it is (excuse me if it is something obvious). But I would go out on a limb and say that most likely this is something we don't want to implement in the |
Agree with @ggerganov . This feature is a cool UX but will be very difficult to maintain. I would categorize such features as "prompt engineering" and not actually an inference feature. Indeed, before chatgpt even have the memory feature, I also implemented myself this feature in my own llama.cpp private fork using both prompts and llama_kv shifting API. It works for a while, but very tricky, doesn't work with all kinds of models. I think in the future, with the addition of MCP in server web UI, this could be implemented in a more generic way. All cool things that people talk about like MCP, agent, tool calling, RAG are just prompt engineering anyway, just matter of how to organize the code. |
@ggerganov @ngxson Thank you both for your quick feedback! FWIW, the goal here isn’t to replicate ChatGPT’s memory feature as a UX layer or only via prompting. My goal is to introduce an interface for directly interacting with inference at a deeper level. Right now that's to provide access to structured, namespaced data storage (key/value in this case). The demo here is just a std::map, but it could easily be sqlite3, S3, or Ceph. In the future I want to do more. I want to be able to eventually enable mid-stream behavioral constraint. That's why I tried to keep the implementation (ChatMemorySimple) separated from the interface that enables it (which I should probably rename, it's really an inference hook). The long term goal is to support external governance scaffolding: tools for hallucination recovery, telos tracking, violation logging, and long-term reasoning...in addition to storing user memories. I suspect that without these kinds of structures, persistent memory features will always be fragile unless reinforced through fine-tuning or runtime constraint. This is an attempt to prototype a real runtime cognition layer, not just simulate memory within the model’s weights. This is my first stab at trying to move some of this from model-level simulation into real code using real storage. If this is something you guys think might be interested in, I would love figure out a lightweight way to tie into the inference loop. That's the key piece I believe I need as I'm not sure I can do everything completely externally. |
Aren't these just prompt engineering? |
Which missing API calls from |
@ngxson Thank you! My background is in storage and this is my first dive into the llama.cpp code, so I confess I'm still working to understand exactly what I need. I believe it might look something like this though:
Most of this is already there and being used in the ChatMemory interface in this PR. I reflect, it might be better to rename this to something like "InferenceHook". The core idea here is to see if this kind of inference-aware runtime behavior shaping could be an optional path forward. I 100% agree that it needs to be opt-in and lightweight though. My hope is that this could allow for a huge amount of flexibility for future developers. |
Update: In parallel I'm working on using the same interface for a governance model where I re-inject feedback into the next prompt based on the previous response. This works, but at least with Gemma3 doesn't consistently override undesirable behavior so I'm now working to learn how logit biases work and where I could potentially modify them. My current goal is to create per-session in-order tracking of tasks so I can then do things like compare responses, set up dynamic logit biases, look at drift, etc. I believe I can do this from within |
Update: I've spent a fair amount of time trying to figure out ways to get the model I've been testing with (Gemma3 4B Q8) to regularly adopt the interface. It sporadically will execute key/value store commands, sometimes even searching in the store for instructions about how to use the commands (which I've tried placing there as a backup to the system prompt), but it is as likely to make up key/value pairs as it is to actually use the tool. In some cases, it will even return non existent keys as the result of a "list" command and return the result of the list command in the same response. I've also tried to implement mid-stream corrections as a test, but that likewise was a failure. Perhaps if I was testing with a higher parameter model it would do a better job. @ngxson You were right, it was quite tricky in the end. I don't think that this is going to work the way I hoped it would unless someone has some idea of how to make tool usage more attractive to the model. I even thought about trying to change the probability distribution to favor the KV commands, but it all felt very invasive and brittle. I also ended up going down a bit of a labyrinth trying to find ways to enforce the rules via very elaborate prompt engineering (while learning how loaded some of the words I was using earlier are!), but the longer sessions go on, the worse the model gets rather than better (as you alluded to). Perhaps the other approaches you mentioned would work better. |
Closing this for now, since I don't see it having a high likelihood of success as-is. Will open a new PR if that changes. |
This is a rough proof-of-concept for implementing a chat memory interface inspired by ChatGPT's memories feature. It is separated into 3 parts:
A key goal for this POC was to minimize the changes to main/server and keep as much of the logic in the chat-memory classes as possible. One specific change that was necessary for instance was to pass the conv_id from the webui back to the server so that each session has its own memory. Potentially per-user or per-group memories could be implemented as well. A future goal for this project would be to allow integration with local databases, S3, and Ceph to store these memories persistently.
The simple implementation has a lot of code dedicated to trying to keep the model from hallucinating about the state of the memory, with limited success. The model used for testing this is Gemma 3 4B Q8 and it aggressively wants to trust its own training to make up fake statistics. It's possible larger or other models may behave better, however this will need active work and may require specialized training to work consistently.
In addition to the above issue (among others!), this POC has several deficiencies:
My goal before taking this any further is to solicit feedback from ggml and the greater community to see if this project merits continued development. While the vast majority of the code is in ChatMemorySimple, the more important pieces to focus on IMHO is the interface, base class, and modifications to the existing code.